-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Fix "could not refresh token" error resulting from concurrent CLI instances #8645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tances Idle Codex CLI instances can get stuck after another concurrently-running instance refreshes and rotates the shared ChatGPT refresh token: the idle process wakes up, gets a 401, and its in-memory refresh token is no longer valid, so refresh fails permanently. This change makes 401 recovery resilient to concurrent token rotation by first syncing ChatGPT tokens from the configured credential store (file/keyring/auto) and retrying the request, then performing a network refresh only if needed (using the refresh token loaded from storage). It also prevents accidental cross-account/workspace switching by only adopting/refreshing when chatgpt_account_id matches the request’s auth snapshot, and adds bounded retries on transient auth.json parse errors to handle concurrent truncate+write. Added unit tests for the storage-sync outcomes.
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Why is this required? |
codex-rs/core/src/auth.rs
Outdated
| .await | ||
| .map_err(RefreshTokenError::Transient)? | ||
| else { | ||
| return Ok(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should a method be extracted here that returns Optional and you can use ? to short circuit all these checks and return Ok(None);s?
codex-rs/core/src/client.rs
Outdated
| auth: &Option<crate::auth::CodexAuth>, | ||
| ) -> Result<()> { | ||
| if *refreshed { | ||
| if recovery.refreshed_token { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the refresh logic fully inside AuthManager so no external checking is needed? We can use some status endpoint to check whether the token is alive.
Will avoid every client having to maintain a complex recovery loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clients already have a recovery loop. Implementing another recovery loop in the AuthManager seems a little redundant, but I agree that we can move more of the auth-specific recovery logic into AuthManager so it doesn't need to be repeated by each clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm mostly worried a about the fact that every client sending requests using token auth will need to reproduce this logic.
pakrym-oai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can make both the refresh logic and the consumption logic simpler?
|
@pakrym-oai, I updated |
| self.auth().map(|a| a.mode) | ||
| } | ||
|
|
||
| pub(crate) async fn sync_from_storage_for_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove pub
| Ok(UnauthorizedRecoveryDecision::Retry) | ||
| } | ||
| SyncFromStorageResult::SkippedMissingIdentity => { | ||
| Ok(UnauthorizedRecoveryDecision::Retry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we retrying on missing identity? Isn't it fatal?
| Ok(SyncFromStorageResult::Applied { changed }) | ||
| } | ||
|
|
||
| pub(crate) async fn refresh_token_for_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rem pub
| }; | ||
|
|
||
| let storage = | ||
| create_auth_storage(self.codex_home.clone(), self.auth_credentials_store_mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use load_auth logic here and compare CodexAuth instances directly?
then we can use CodexAuth.refresh_token and avoid having another place where we update tokens
| return Ok(SyncFromStorageResult::IdentityMismatch); | ||
| } | ||
|
|
||
| let changed = if let Some(current) = self.auth() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we share this entire methods logic with reload() ? Seems very similar except for the extra identity check?
| // Another instance may have refreshed and rotated the refresh token while we | ||
| // were attempting our refresh. Reload and retry once if the stored refresh | ||
| // token differs and identity still matches. | ||
| let Some(stored_refresh_token) = load_stored_refresh_token_if_identity_matches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse sync_from_storage_for_request here?
so we reload the entire auth object if possible and then call refresh token on it if needed?
| Ok(Some(tokens.refresh_token)) | ||
| } | ||
|
|
||
| async fn load_auth_dot_json_with_retries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default storage implementation do retries?
| }; | ||
|
|
||
| if stored_account_id != expected_account_id { | ||
| // Keep cached auth in sync for subsequent requests, but do not retry the in-flight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. If we refresh the cached auth the next request will pick it up. The client pulls auth() for retries -
codex/codex-rs/core/src/client.rs
Lines 246 to 247 in d681ed2
| let auth = auth_manager.as_ref().and_then(|m| m.auth()); | |
| let api_provider = self |
Idle Codex CLI instances can get stuck after another concurrently-running instance refreshes and rotates the shared ChatGPT refresh token: the idle process wakes up, gets a 401, and its in-memory refresh token is no longer valid, so refresh fails permanently.
This change makes 401 recovery resilient to concurrent token rotation by first syncing ChatGPT tokens from the configured credential store (file/keyring/auto) and retrying the request, then performing a network refresh only if needed (using the refresh token loaded from storage). It also prevents accidental cross-account/workspace switching by only adopting/refreshing when chatgpt_account_id matches the request’s auth snapshot, and adds bounded retries on transient auth.json parse errors to handle concurrent truncate+write. Added unit tests for the storage-sync outcomes.
This addresses #6498, which several users have reported.